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Abstract. This paper proposes a new approach to detecting grasp points 
on novel objects presented in clutter. The input to our algorithm is 
a point cloud and the geometric parameters of the robot hand. The 
output is a set of hand conhgurations that are expected to be good 
grasps. Our key idea is to use knowledge of the geometry of a good grasp 
to improve detection. First, we use a geometrically necessary condition 
to sample a large set of high quality grasp hypotheses. We were sur¬ 
prised to hnd that using simple geometric conditions for detection can 
result in a relatively high grasp success rate. Second, we use the notion 
of an antipodal grasp (a standard characterization of a good two hn- 
gered grasp) to help us classify these grasp hypotheses. In particular, 
we generate a large automatically labeled training set that gives us high 
classihcation accuracy. Overall, our method achieves an average grasp 
success rate of 88% when grasping novels objects presented in isolation 
and an average success rate of 73% when grasping novel objects pre¬ 
sented in dense clutter. This system is available as a ROS package at 
http://wiki.ros.org/agile_grasp 


1 Introduction 


Traditionally, robot grasping is understood in terms of two related subproblems: 
perception and planning. The goal of the perceptual component is to estimate 
the position and orientation (pose) of an object to be grasped. Then, grasp and 
motion planners are used to calculate where to move the robot arm and hand 
in order to perform grasp. While this approach can work in ideal scenarios, it 
has proven to be surprisingly difficult to localize the pose of novel objects in 
clutter accurately [5]. More recently, researchers have proposed various grasp 
point detection methods that localize grasps independently of object identity. 
One class of approaches use a sliding window to detect regions of an RGBD 
image or a height map where a grasp is likely to succeed |16l7l3l4ll2l9j . Other 
approaches extrapolate local “grasp prototypes” based on human-provided grasp 
demonstrations mm- 

A missing element in the above works is that they do not leverage the ge¬ 
ometry of grasping to improve detection. Grasp geometry has been studied ex¬ 
tensively in the literature (for example |13ll7j ). Moreover, point clouds created 
using depth sensors would seem to be well suited for geometric reasoning. In this 
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paper, we propose an algorithm that detects grasps in a point cloud by predicting 
the presence of necessary and sufficient geometric conditions for grasping. The 
algorithm has two steps. First, we sample a large set of grasp hypotheses. Then, 
we classify those hypotheses as grasps or not using machine learning. Geometric 
information is used in both steps. First, we use geometry to reduce the size of 
the sample space. A trivial necessary condition for a grasp to exist is that the 
hand must be collision-free and part 
of the object surface must be con¬ 
tained between the two fingers. We 
propose a sampling method that only 
produces hypotheses that satisfy this 
condition. This simple step should 
boost detection accuracy relative to 
approaches that consider every possi¬ 
ble hand placement a valid hypoth¬ 
esis. The second way that our algo¬ 
rithm uses geometric information is to 
automatically label the training set. A 
necessary and sufficient condition for 
a two-finger grasp is an antipodal con¬ 
tact configuration (see Definition [^. 

Unfortunately, we cannot reliably de¬ 
tect an antipodal configuration in most 
sions. However, it is nevertheless possible sometimes to verify a grasp using this 
condition. We use the antipodal condition to label a subset of grasp hypothe¬ 
ses in arbitrary point clouds containing ordinary graspable objects. We generate 
large amounts of training data this way because it is relatively easy to take lots 
of range images of ordinary objects. This is a huge advantage relative to ap¬ 
proaches that depend on human annotations because large amounts of training 
data can significantly improve classification performance. 

Our experiments indicate that the approach described above performs well 
in practice. We find that without using any machine learning and just using our 
collision-free sampling algorithm as a grasp detection method, we achieve a 73% 
grasp success rate for novel objects. This is remarkable because this is a trivially 
simple detection criterion. When a classification step is added to the process, 
our grasp success rate jumps to 88%. This success rate is competitive with the 
best results that have been reported. However, what is particularly interesting is 
the fact that our algorithm achieves an average 73% grasp success rate in dense 
clutter such as that shown in Figure This is exciting because dense clutter is 
a worst-case scenario for grasping. Clutter creates lots of occlusions that make 
perception more difficult and obstacles that make reaching and grasping harder. 



Fig. 1. Our algorithm is able to localize and 
grasp novel objects in dense clutter. 


real-world point clouds because of occlu- 


1.1 Related Work 

The idea of searching an image for grasp targets independently of object identity 
was probably explored first in Saxena’s early work that used a sliding window 
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classifier to localize good grasps based on a broad collection of local visual fea¬ 
tures m- Later work extended this concept to range data [7] and explored a 
deep learning approach [12]. In [T2|, they obtain an 84% success rate on Baxter 
and a 92% success rate on the PR2 for objects presented in isolation (averaged 
over 100 trials). Fischinger and Vincze developed a similar method that uses 
heightmaps instead of range images and develops a different Haar-like feature 
representation |3l4j . In [4], they report a 92% single-object grasp success rate 
averaged over 50 grasp trials using the PR2. This work is particularly interesting 
because they demonstrate clutter results where the robot grasps and removes 
up to 10 piled objects from a box. They report that over six clear-the-box runs, 
their algorithm removes an average of 87% of the objects from the box. Other 
approaches search a range image or point cloud for hand-coded geometries that 
are expected to be associated with a good grasp. For example Klingbeil et. al 
search a range image for a gripper-shaped pattern [9]. In our prior work, we 
developed an approach to localizing handles by searching a point cloud for a 
cylindrical shell m- Other approaches follow a template-based approach where 
grasps that are demonstrated on a set of training objects are generalized to new 
objects. For example, Herzog et. al learn to select a grasp template from a li¬ 
brary based on features of the novel object [6]. Detry et. al grasp novel objects 
by modeling the geometry of local object shapes and fitting these shapes to new 
objects [2]. Kroemer et. al propose an object affordance learning strategy where 
the system learns to match shape templates against various actions afforded by 
those templates m- Another class of approaches worth mentioning are based on 
interacting with a stack of objects. For example, Katz et. al developed a method 
of grasping novel objects based on interactively pushing the objects in order to 
improve object segmentation [8|. Chang et al. developed a method of segment¬ 
ing objects by physically manipulating them [T]. The approach presented in this 
paper is distinguished from the above primarily because of the way we use geo¬ 
metric information. Our use of geometry to generate grasp hypotheses is novel. 
Moreover, our ability to generate large amounts of labeled training data could be 
very important for improving detection accuracy in the future. However, what 
is perhaps most important is that we demonstrate “reasonable” (73%) grasp 
success rates in dense clutter - arguably a worst-case scenario for grasping. 

2 Approach 

We frame the problem of localizing grasp targets in terms of locating antipodal 
hands, an idea that we introduce based on the concept of an antipodal grasp. In 
an antipodal grasp, the robot hand is able to apply opposite and co-linear forces 
at two points: 

Definition 1 (Nguyen [14]). A pair of point eontaets with frietion is antipo¬ 
dal if and only if the line eonneeting the eontaet points lies inside both frietion 
eones^ 

^ A friction cone describes the space of normal and frictional forces that a point contact 
with friction can apply to the contacted surface m- 
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If an antipodal grasp exists, then the robot can hold the object by applying 
sufficiently large forces along the line connecting the two contact points. In this 
paper, we restrict consideration to parallel jaw grippers - hands with parallel fin¬ 
gers and a single closing degree of freedom. Since a parallel jaw gripper can only 
apply forces along the (single) direction of gripper motion, we will additionally 
require the two contact points to lie along a line parallel to the direction of finger 
motion. Rather than localizing antipodal contact configurations directly, we will 
localize hand configurations where we expect an antipodal grasp to be achieved 
in the future when the hand closes. Let W C denote the robot workspace 
and let O C VP denote space occupied by objects or obstacles. Let H C SE{3) 
denote the configuration space of the hand when the fingers are fully open. We 
will refer to a configuration h G iL as simply a “hand”. Let B{h) C W denote 
the volume occupied by the hand in configuration h G when the fingers are 
fully open. 

Definition 2. An antipodal hand is a pose of the hand, h ^ H, sueh that the 
hand is not in eollision with any objeets or obstaeles, B{h) fl O = 0, and at least 
one pair of antipodal eontaets will be formed when the fingers elose sueh that the 
line eonneeting the two eontaets is parallel to the direetion of finger motion. 

Algorithm illustrates at a high level our algorithm for detecting antipodal 
hands. It takes a point cloud, C C and a geometric model of the robot hand 
as input and produces as output a set of hands, H E that are predicted to be 
antipodal. There are two main steps. First, we sample a set of hand hypotheses. 
Then, we classify each hypothesis as an antipodal hand or not. These steps are 
described in detail in the following sections. 


Algorithm 1 Detect Antipodal Hands 

Input: a point cloud, C, and hand parameters, 0 
Output: antipodal hands, 1-L 
1- hihyp — Sample-HandsiC) 

2\ 1-L — Classify-Hands(l-Lhyp) 


3 Sampling Hands 

A key part of our algorithm is the approach to sampling from the space of hand 
hypotheses. A naive approach would be to sample directly from H C SE{3). 
Unfortunately, this would be immensely inefficient because SE{3) is a 6-DOF 
space and many hands sampled this way would be far away from any visible 
parts of the point cloud. Instead, we define a lower-dimensional sample space 
constrained by the geometry of the point cloud. 
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3.1 Geometry of the Hand and the Object Surface 

Before describing the sample space, 
we quantify certain parameters re¬ 
lated to the grasp geometry. We as¬ 
sume the hand, h G i^, is a parallel 
jaw gripper comprised of two parallel 
fingers each modeled as a rectangular 
prism that moves parallel to a com¬ 
mon plane. Let a{h) denote a unit vec¬ 
tor orthogonal to this plane. The hand 
is fully specified by the parameter vec¬ 
tor 0 = (0i 5 ^(!•) where 0i and 

denote the length and width of the fin¬ 
gers; Od denotes the distance between 
the fingers when fully open; and Ot de¬ 
notes the thickness of the fingers (or¬ 
thogonal to the page in Figure]^ (a)). 

Define the closing region, R{h) C W, to be the volumetric region swept out by 
the fingers when they close. Let r{h) G R{h) denote an arbitrary reference point 
in the closing region. Define the closing plane, C{h)^ to be the subset of the 
plane that intersects r(h), is orthogonal to d(h), and is contained within R(h)\ 

C{h) = {pG R{h)\{p - r{h)fa{h) = 0}. 



We also introduce some notation related to the differential geometry of the 
surfaces we are grasping. Recall that each point on a differentiable surface is as¬ 
sociated with a surface normal and two principal curvatures where each principal 
curvature is associated with a principal direction. The surface normal and the two 
principal directions define an orthogonal basis known as a Darboux frame The 
Darboux frame at point p G C will be denoted: F{p) = {n{p) {d{p) x n{p)) d{p)), 
where n{p) denotes the unit surface normal and d{p) denotes the direction of 
minimum principal curvature at point p. Define the cutting plane to be the 
plane orthogonal to d{p) that passes through p (see Figure (b)). Since we 
are dealing with point clouds, it is not possible to measure the Darboux frame 
exactly at each point. Instead, we estimate the surface normal and principle di¬ 
rections over a small neighborhood. We fit a quadratic function over the points 
contained within a small ball (3 cm radius in our experiments) using Taubin’s 
method |1«|19| and use that to calculate the Darboux frame 

^ Any frame aligned with the surface normal is a Darboux frame. Here we restrict 
consideration to the special case where it is also aligned with the principal directions. 
^ Taubin’s method is an analytic solution that performs this fit efficiently by solving 
a generalized Eigenvalue problem on two 10 x 10 matrices [18]. In comparison to 
using first order estimates of surface normal and curvature, the estimates derived 
from this quadratic are more robust to local surface discontinuities. 
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3.2 Hand Sample Set 

We want a set that contains many antipodal hands and from which it is easy to 
draw samples. The following conditions define the set %. First, for every hand, 

hen: 

Constraint 1. The body of the hand is not in eollision with the point eloud: 

5(h) nc = 0, 

Furthermore, there must exist a point in the cloud, p e such that: 

Constraint 2. The hand elosing plane eontains p: p e C(h). 

Constraint 3. The elosing plane of the hand is parallel to the eutting plane at 
p: a{p) = a{h). 

These three constraints define the following set of hands: 

n = UpecH{p), H{p) = {heH\pe C{h) A a{p) = a{h) A B{h) H C = 0}. (1) 

Constraint is essentially a heuristic that limits the hand hypotheses that our 
algorithm considers. While this eliminates from consideration many otherwise 
good grasps, it is a practical way to focus detection on likely candidates. More¬ 
over, it is easy to sample from n by: 1) sampling a point, p e from the cloud; 
2) sampling one or more hands from H{p). Notice that for each p G C, 5(p) is 
three-DOF because we have constrained two DOF of orientation and one DOF 
of position. This means that H is much smaller than H and it can therefore be 
covered by many fewer samples. 


Algorithm 2 Sample Hands 

Input: point cloud, C, hand parameters, 0 
Output: grasp hypotheses, n 
1 : 5 = 0 

2: Preprocess C (voxelize; workspace limits; etc.) 

3: for i = 1 to n do 

4: Sample p e C uniformly randomly 

5: Calculate ^^-ball about p\ N{p) = {q e C : ||p — ^|| < Od} 

6: Estimate local Darboux frame at p: F{p) = Estimate-Darboux{N{p)) 

7: if = Grid-Search{F(p), N(p)) 

8 : n = nuH 

9: end for 


The sampling process is detailed in Algorithm First, we preprocess the 
point cloud, C, in the usual way by voxelizing (we use voxels 3mm on a side in 
our experiments) and applying workspace limits (Step 2). Second, we iteratively 
sample a set of n points (n is between 4000 and 8000 in our experiments) from 
the cloud (Step 4). For each point, p G C, we calculate a neighborhood, N{p), 
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in the Od-hall around p (using a KD-tree, Step 5). The next step is to estimate 
the Darboux frame at p by fitting a quadratic surface using Taubin’s method 
and calculating the surface normal and principal curvature directions (Step 6). 
Next, we sample a set of hand configurations over a coarse two-DOF grid in a 
neighborhood about p. Let hx,y,cf){p) ^ H{p) denote the hand at position (x, 0) 

with orientation (j) with respect to the Darboux frame, F{p). Let ^ denote a 
discrete set of orientations (8 in our implementation). Let X denote a discrete 
set of hand positions (20 in our implementation). For each hand configuration 
(0, x) G ^ X X, we calculate the hand configuration furthest along the y axis 
that remains collision free: = max^^y such that B{hx,y,cf)) H X = 0, where 

y = [—^di^d] (Step 3). Then, we check whether the closing plane for this hand 
configuration contains points in the cloud (Step 4). If it does, then we add the 
hand to the hypotheses set (Step 5). 


Algorithm 3 Grid Search 

Input: neighborhood point cloud, N; Darboux frame, F 
Output: neighborhood grasp hypotheses, H 
1: i7 = 0 

2: for all (0, x) G ^ x X do 

3: Push hand until collision: y* = max^/ev such that B{hci)^x,y) H X = 0 

4: if closing plane not empty: C{h(p^x,y*) H X 7 ^ 0 then 

5. H — H U hcj)jX,y* 

6: end if 

7: end for 


3.3 Grasping Results 

Interestingly, our experiments indicate that this sampling method by itself can 
be used to do grasping. In Algorithmic the sampling process is followed by the 
grasp classification process described in the next section. However, if we omit 
classification, implicitly assuming that all grasp hypotheses are true grasps, we 
obtain a surprisingly high grasp success rate of approximately 73% (the column 
labeled NC, 2Vm Figure [C- The experimental context of this result is described 
in Section IC Essentially, we cluster the sampled hands and use a heuristic grasp 
selection strategy to choose a grasp to execute (see Section [5^ . This result is 
surprising because the sampling constraints (Constraints 1-3) encode relatively 
simple geometric conditions. It suggests that these sampling constraints are an 
important part of our overall grasp success rates. 

4 Classifying Hand Hypotheses 

After generating hand hypotheses, the next step is to classify each of those hy¬ 
potheses as antipodal or not. The simplest approach would be to infer object 
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surface geometry from the point cloud and then check which hands satisfy Def¬ 
inition 1^ Unfortunately, since most real-world point clouds are partial, many 
hand hypotheses will fail this check simply because all relevant object surfaces 
were not visible to a sensor. Instead, we infer which hypotheses are likely to be 
antipodal using machine learning {i.e. classification). 

4.1 Labeling Grasp Hypotheses 

Many approaches to grasp point detection require 
large amounts of training data where humans have an¬ 
notated images with good grasp points |16l7ll2l3l4l6| . 

Unfortunately, obtaining these labels is challenging 
because it can be hard for human labelers to predict 
what object surfaces in a scene might be graspable for 
a robot. Instead, our method automatically labels a 
set of training images by checking a relaxed version of 
the conditions of Definition [21 

In order to check whether a hand hypotheses, 
h G i7, is antipodal, we need to determine whether 
an antipodal pair of contacts will be formed when 
the hand closes. Let f{h) denote the direction of 
closing of one finger. (In a parallel jaw gripper, the 
other finger closes in the opposite direction). When 
the fingers close, they will make first contact with 
an extremal pair of points, Si,52 G R{h) such that 
Vs e R{h),slf{h) > f{h) As'^fih) < f{h). An 

antipodal hand requires two such extremal points to 
be antipodal and for the line connecting the points to be parallel to the direction 
of finger closing. In practice, we relax this condition slightly as follows. First, 
rather than checking for extremal points, we check for points that have a sur¬ 
face normal parallel to the direction of closing. This is essentially a first-order 
condition for an extremal point that is more robust to outliers in the cloud. The 
second way that we relax Definition [2] is to drop the requirement that the line 
connecting the two contacts be parallel to the direction of finger closing and to 
substitute a requirement that at least k points are found with an appropriate 
surface normal. Again, the intention here is to make detection more robust: if 
there are at least k points near each finger with surface normals parallel to the 
direction of closing, then it is likely that the line connecting at least one pair 
will be nearly parallel to the direction of finger closing. In summary, we check 
whether the following definition is satisfied: 

Definition 3. A hand, h G H, is near antipodal for thresholds /c G N and 0 G 
[0,pi/2] when there exist k points pi,... ,p/c G R{h) D C sueh that h{pi)^f{h) > 
cos 0 and k points gi,..., g/c G R{h) fl C sueh that h{qi)^ f{h) < — cos 0 . 

When Definition [^ is satisfied, then we label the corresponding hand a positive 
instance. Note that in order to check for this condition, it is necessary to register 



Fig. 3. Our robot has 
stereo RGBD sensors. 
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at least two point clouds produced by range sensors that have observed the 
scene from different perspectives (Figure]^. This is because we need to “see” 
two nearly opposite surfaces on an object. Even then, many antipodal hands 
will not be identified as such because only one side of the object is visible. 
These “indeterminate” hands are omitted from the training set. In some cases, 
it is possible to verify that a particular hand is not antipodal by checking that 
there are fewer than k points in the hand closing region that satisfy either of 
the conditions of Definition These hands are included in the training set as 
negative examples. This assumes that the closing region of every sampled hand 
hypothesis is at least partially visible to a sensor. If there are fewer than k 
satisfying points, then Definition would not be satisfied even if the opposite 
side of an object was observed. In our experiments, we set the thresholds k = 6 
and 0 = 20 degrees. 


4.2 Feature Representation 


In order to classify hand hypotheses, a feature descriptor 
is needed. Specifically, for a given hand h G i^, we need 
to encode the geometry of the points contained within the 
hand closing region, C fl R{h). A variety of relevant de¬ 
scriptors have been explored in the literature |l0ll5l20| . 
In our case, we achieve good performance using a simple 
descriptor based on HOG features. For a point cloud, C, a 
two dimensional image of the closing region is created by 
projecting the points Cr\R{h) onto the hand closing plane: 

I{C, h) = Si 2 F{h)^ {N n C{h)), where '^12 = ( J J g) se- 

lects the first two rows of F(h)^. We call this the grasp 
hypothesis image. We encode it using the HOG de¬ 
scriptor, HOG{I{C,h)). In our implementation, we chose 
a HOG cell size such that the grasp hypothesis image was 
covered by 10 x 12 cells with a standard 2x2 block size. 



Fig. 4. HOG feature 
representation of a 
hand hypothesis for 
the box shown in Fig¬ 
ure (b). 


4.3 Creating the Training Set 

In order to create the training set, we obtain a set of objects that have local 
geometries similar to what might be expected in the field. In our work, we 
selected the set of 18 objects shown in Figure (a). Each object was placed in 
front of the robot in two configurations: one upright configuration and one on 
its side. For each configuration (36 configurations total), let Ci and C 2 denote 
the voxelized point clouds obtained from each of the two sensors, respectively, 
and let C 12 = Ci U C 2 denote the registered two-view cloud. 

The training data is generated as follows. First, we extract hand hypotheses 
from the registered cloud, C 12 using the methods of Section Second, for each 
h e H, we determine whether it is a positive, negative, or indeterminate by 
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Fig. 5. (a) training set comprised of 18 objects, (b-d) illustration of the three grasp 
hypotheses images incorporated into the training set per hand. The blue triangles at 
the bottom denote positions of the two range sensors, (c-d) illustrate training images 
created using data from only one sensor. 


checking the conditions of Definition Indeterminate hands are discarded from 
training. Third, for each positive or negative hand, we extract three feature de¬ 
scriptors: HOG{I{Ci^ h))^ H0G{I{C2, h))^ and HOG{I{Ci 2 ^ h)). Each descriptor 
is given the same label and incorporated into the training set. Over our 18 object 
training set, this procedure generated approximately 6500 positive and negative 
labeled examples that were used to train an SVM. We only did one round of 
training using this single training set. 

The fact that we extract three feature descriptors per hand in step three 
above is important because it helps us to capture the appearance of partial 
views in the training set. Figure (b-d) illustrates the three descriptors for an 
antipodal hand. Even though the closing region of this hand is relatively well 
observed in C 12 , the fact that we incorporate HOG{I{Ci^ h)) and HOG{I{C 2 ^ h)) 
into the dataset means that we are emulating what would have been observed if 
we only had a partial view. This makes our method much more robust to partial 
point cloud information. 

4.4 Cross Validation 

We performed cross validation on a dataset derived from the 18 training objects 
shown in Figure(a). For each object, we obtained a registered point cloud for 
two configurations (total of 36 configurations). Following the procedure described 
in this section, we obtained 6500 labeled features with 3405 positives and 3095 
negatives. We did 10-fold cross validation on this dataset using an SVM for 
the various Gaussian and polynomial kernels available in Mat lab. We obtained 
97.8% accuracy using a degree-three polynomial kernel and used this kernel in 
the remainder of our experiments. In the cross validation experiment described 
above, the folds were random across the labeled pairs in the dataset. This does 
not capture the effects of experiencing novel objects or the expected performance 
when only single-view point clouds are available. Therefore, we did the following. 
First, we trained the system using the degree-three polynomial kernel on the 6500 
labeled examples as described above. Then, we obtained additional single-view 
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point clouds for each of the 30 novel test ob¬ 
jects shown in Figure]^ (each object was pre¬ 
sented in isolation) for a total of 122 single¬ 
view points clouds. We used the methods de¬ 
scribed in this section to obtain ground-truth 
for this dataset. This gave us a total of 7250 
labeled single-view hypotheses on novel ob¬ 
jects with 1130 positives and 6120 negatives. 
We obtained 94.3% accuracy on this dataset. 
The fact that we do relatively well in these 
cross validation experiments using a relatively 
simple feature descriptor and without mining 
hard negatives suggests that our approach to 
sampling hands and creating the grasp hy¬ 
pothesis image makes the grasp classification 
task easier than it is in approaches that do not 
use this kind of structure ii6 m 3|4|n|. 



Fig. 6. The 30 objects in our test 
set. 


5 Robot Experiments 

We evaluated the performance of our algorithms using the Baxter robot from 
Rethink Robotics. We explore two experimental settings: when objects are pre¬ 
sented to the robot in isolation and when objects are presented in a dense clutter 
scenario. We use the Baxter right arm equipped with the stock two-finger Baxter 
gripper. A key constraint of the Baxter gripper is the limited finger stroke: each 
finger has only 2 cm stroke. In these experiments, we adjust the finger positions 
such that they are 3 cm apart when closed and 7 cm apart when open. This 
means we cannot grasp anything smaller than 3 cm or larger than 7 cm. We 
chose each object in the training and test sets so that it could be grasped un¬ 
der these constraints. Two-view registered point clouds were created using Asus 
Xtion Pro range sensors (see Figure]^. It should be possible for anyone with a 
Baxter robot and the appropriate depth sensors to replicate any of these exper¬ 
iments by running our ROS package at http://wiki.ros.org/agile_grasp, 


5.1 Grasp Selection 

Since our algorithm typically finds tens or hundreds of potential antipodal hands, 
depending upon the number of objects in the scene, it is necessary to select one 
to execute. One method might be to select a grasp on an object of interest. 
However, in this paper, we ignore object identity and perform any feasible grasp. 
We choose a grasp to attempt as follows. First, we sparsify the set of grasp 
choices by clustering antipodal hands based on distance and orientation. Grasp 
hypothesis that are nearby each other and that are roughly aligned in orientation 
are grouped together. Each cluster must be composed of a specified minimum 
number of constituent grasps. If a cluster is found, then we create a new grasp 
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Object 

number 
of poses 

Succ. Rate 
A, 2V 


number 
of poses 

Success Rate 

NC, IV 

NC, 2V 

SVM, IV 

SVM, 2V 

Plush drill 

3 

100.00% 


6 

50.00% 

66.67% 

100.00 

66.67% 

Black pepper 

3 

100.00% 


8 

62.5% 

62.50% 

75.00 

100.00% 

Dremel engraver 

3 

100.00% 


6 

33.33% 

50.00% 

66.67 

100.00% 

Sand castle 

3 

100.00% 


6 

50.00% 

33.33% 

83.33 

83.33% 

Purple ball 

0 

NA 


6 

66.67% 

100.00% 

83.33 

100.00% 

White yarn roll 

3 

100.00% 


8 

87.50% 

87.50% 

87.50 

75.00% 

Odor protection 

0 

NA 


8 

50.00% 

87.50% 

87.50 

75.00% 

Neutrogena box 

3 

66.67% 


8 

25.00% 

87.50% 

87.50 

87.50% 

Plush screwdriver 

3 

100.00% 


6 

83.33% 

87.50% 

83.33 

100.00% 

Toy banana box 

3 

100.00% 


8 

100% 

83.33% 

87.50 

75.00% 

Rocket 

3 

100.00% 


8 

50.00% 

87.50% 

100.00 

87.50% 

Toy screw 

3 

100.00% 


6 

100.00% 

100.00% 

83.33 

100.00% 

Lamp 

3 

100.00% 


8 

62.50% 

83.33% 

87.50 

87.50% 

Toothpaste box 

3 

66.67% 


8 

87.50% 

100.00% 

87.50 

87.50% 

White squirt bottle 

3 

66.67% 


8 

25.00% 

12.50% 

75.00 

87.50% 

White rope 

3 

100.00% 


6 

66.67% 

83.33% 

83.33 

100.00% 

Whiteboard cleaner 

3 

100.00% 


8 

62.50% 

75.00% 

100.00 

100.00% 

Toy train 

0 

NA 


8 

87.50% 

100.00% 

87.50 

100.00% 

Vacuum part 

3 

100.00% 


6 

33.33% 

66.67% 

100.00 

83.33% 

Computer mouse 

0 

NA 


6 

33.33% 

33.33% 

66.67 

83.33% 

Vacuum brush 

1 

100% 


6 

50.00% 

83.33% 

66.67 

50.00% 

Lint roller 

3 

100.00% 


8 

75.00% 

75.00% 

87.50 

100.00% 

Ranch seasoning 

3 

100.00% 


8 

50.00% 

75.00% 

100.00 

100.00% 

Red pepper 

3 

100.00% 


8 

75.00% 

75.00% 

100.00 

100.00% 

Crystal light 

3 

100.00% 


8 

25.00% 

37.50% 

75.00 

75.00% 

Red thread 

3 

100.00% 


8 

75.00% 

100.00% 

100.00 

100.00% 

Kleenex 

3 

100.00% 


6 

33.33% 

33.33% 

83.33 

83.33% 

Lobster 

3 

66.67% 


6 

16.67% 

83.33% 

66.67 

83.33% 

Boat 

3 

100.00% 


6 

83.33% 

100.00% 

83.33 

100.00% 

Blue squirt bottle 

2 

100% 


8 

25.00% 

50.00% 

75.00 

62.50% 


[Average | | 94.67% | | | 57.50% | 72.92% | 85.00% | 87.78% | 


Fig. 7. Single object experimental results. Algorithm variations are denoted as: A for 
antipodal grasps (see Section 4.1), NC for sampling without grasp classification (see 
Section 1^, and SVM for our full detection system. 


hypothesis positioned at the mean of the cluster and oriented with the “average” 
orientation of the constituent grasps. The next step is to select a grasp based on 
how easily it can be reached by the robot. First, we solve the inverse kinematics 
(IK) for each of the potential grasps and discard those for which no solution 
exists. The remaining grasps are ranked according to three criteria: 1) distance 
from joint limits (a piecewise function that is zero far from the arm joint limits 
and quadratic nearby the limits); 2) distance from hand joint limits (zero far 
from the limits and quadratic nearby limits); 3) workspace distance traveled by 
the hand starting from a fixed pre-grasp arm configuration. These three criteria 
are minimized in order of priority: first we select the set of grasps that minimize 
Criterion Of those, we select those that minimize Criterion ^2. Of those, 
we select the one that minimizes Criterion #3 as the grasp to be executed by 
the robot. 

5.2 Objects Presented in Isolation 

We performed a series of experiments to evaluate how well various parts of our 
algorithm perform in the context of grasping each of the 30 test set objects 
(Figure]^. Each object was presented to the robot in isolation on a table in 
front of the robot. We characterize three variations on our algorithm: 

1. No Classification: We assume that all hand hypotheses generated by the 
sampling algorithm (Algorithm]^ are antipodal and pass all hand samples 
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directly to the grasp selection mechanism without classification as described 
in Section [5J1 

2. Antipodal: We classify hand hypotheses by evaluating the conditions of 
Definition directly for each hand and pass the results to grasp selection. 

3. SVM: We classify hand hypotheses using the SVM and pass the results to 
grasp selection. The system was trained using the 18-object training set as 
described in Section [tU 


In all scenarios, a grasp trial was considered a success only when the robot 
successfully localized, grasped, lifted, and transported the object to a box on 
the side of the table. We evaluate No Classification and SVM for single-view 
and two-view registered points clouds over 214 grasps of the 30 test objects. 
Each object was placed in between 6 and 8 systematically different orientations 
relative to the robot. 

Figure shows the results. The results for No Classification are shown in 
columns NC, IV and NC, 2V. Column NC, IV shows that with a point cloud 
created using only one depth sensor, using the results of sampling with no ad¬ 
ditional classification results in an average grasp success rate of 58%. However, 
as shown in Column NC, 2V, it is possible to raise this success rate to 73% just 
by adding a second depth sensor and using the resulting two-view registered 
cloud. The fact that we obtain a grasp success rate as high as 73% here is sur¬ 
prising considering that the sample strategy employs rather simple geometric 
constraints. This suggests that even simple geometric constraints can improve 
grasp detection significantly. The results for Antipodal are shown in the column 
labeled A, 2V. We did not evaluate this variation for a one-view cloud because 
a two-view cloud is needed for Definition to find any near antipodal hands. 
Compared to the other two approaches. Antipodal finds relatively few positives. 
This is because this method needs to “see” two sides of a potential grasp sur¬ 
face in order to verify the presence of a grasp. As a result, we were only able 
to evaluate this method over three poses per object instead of six or eight. In 
fact. Antipodal failed to find any grasps at all for four of the 30 objects. Overall, 
Antipodal can be an effective way to detect grasps (94.7% grasp success rate), 
but since it is not robust to occlusions at all, it is not very useful in practice. 
The results for SVM are shown in columns SVM, IV and SVM, 2V (results 
for one-view and two-view point clouds, respectively). Interestingly, there is not 
much advantage here to adding a second depth camera: we achieve an 85.0% 
success rate with a one-view point cloud and an 87.8% success rate with a two- 
view registered cloud. Drilling down into these numbers, we find the following 
three major causes of grasp failure: 1) approximately 5.6% of the grasp failure 
rate in both scenarios is due to collisions between the gripper and the object 
caused by arm calibration errors or collisions with observed or unobserved parts 
of the environment; 2) approximately 3.5% of the objects were dropped after a 
successful initial grasp; 3) approximately 2.3% of grasp failures in the two-view 
case (3.7% in the one view case) were caused by perceptual errors. The striking 
thing about the causes of failure listed above is that they are not all perceptual 


14 Andreas ten Pas and Robert Platt 

errors: if we want to improve beyond the 87.8% success rate, we need to improve 
performance in multiple areas. 

In the experiments described above, we elimi¬ 
nated seven objects from the test set because they 
were hard to see with our depth sensor (Asus Prime- 
sense) due to specularity, transparency, or color. We 
characterized grasp performance for these objects 
separately by grasping each of these objects in eight 
different poses (total of 56 grasps over all seven ob¬ 
jects). Using SVM, we obtain a 66.7% grasp success 
rate using a single-view point cloud and a 83.3% 
grasp success rate when a two-view cloud is used. 

This result suggests: 1) our 87.8% success rate drops 
to 83% for hard-to-see objects; 2) creating a more 
complete point cloud by adding additional sensors 
is particularly important in non-ideal viewing con¬ 
ditions. 

5.3 Objects Presented in Dense Clutter 

We also characterized our algorithm in dense clutter as illustrated in Figure 
We created a test scenario where ten objects are piled together in a shallow box. 
We used exactly the same algorithm (i.e. SVM) in this experiment as in the 
isolated object experiments. We used a two-view registered point cloud in all 
cluttered scenarios. The 27 objects used in this experiment are a subset of the 
30 objects used in the single object experiments. We eliminated the computer 
mouse and the engraver because they have cables attached to them that can get 
stuck in the clutter. We also removed the vacuum brush because the brush part 
cannot be grasped by the Baxter gripper in some configurations due to the 3-7 
cm aperture limits. At the beginning of each run, we randomly selected 10 out 
of the 27 objects and placed them in a small rectangular container. We then 
shook the container to mix up the items and emptied it into the shallow box 
on top of the table. We excluded all runs where the sandcastle landed upside 
down because the Baxter gripper cannot grasp it in that configuration. A run 
was terminated when three consecutive localization failures occurred. In total, 
we performed 10 runs of this experiment. 

Over all 10 runs of this experiment, the robot performed 113 grasps. On aver¬ 
age, it succeeded in removing 85% of the objects from each box. The remaining 
objects were not grasped because the system failed to localize a grasp point three 
times in a row. Over all grasp attempts, 73% succeeded. The 27% failure rate 
breaks down into the following major failure modes: 3% due to arm calibration 
errors; 9% due to perceptual errors; 4% due to dropped objects following a suc¬ 
cessful grasp; and 4% due to collision with the environment. In comparison with 
the isolation results, these results have a significantly higher perceptual failure 
rate. We believe this is mainly due to the extensive occlusions in the clutter 



Fig. 8. Hard-to-see objects. 


scenario. 
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Fig. 9. Dense clutter scenario, (a) RGB image, (b) Output of our algorithm. 


6 Conclusion 

This paper proposes a new approach to localizing grasp points on novel objects 
presented in clutter. Our main idea is to improve detection by using geomet¬ 
ric knowledge about good grasps. We first create a large set of high quality 
grasp hypotheses by drawing samples that satisfy simple, geometrically neces¬ 
sary conditions. We then use the geometry of an antipodal grasp to create a large 
automatically labeled training set that enables us to achieve high classification 
accuracy using an SVM. If we omit the classification phase of this algorithm and 
consider all samples to be good grasps, then we achieve an average grasp success 
rate of 73% when grasping objects presented in isolation. This success rate is 
surprisingly high because the sampling process only checks very simple necessary 
conditions on the presence of a grasp. It suggests that our proposed geometry- 
based sampling method is very effective. The average success rate increases to 
87.8% when the sampled hypotheses are classified as antipodal grasps using an 
SVM. When grasping novel objects presented in dense clutter, the success rate 
drops to 73% as a result of extensive occlusions. The fact that performance drops 
so significantly in dense clutter suggests that it is important to study the per¬ 
ceptual challenges unique to dense clutter grasp scenarios. This paper is one of 
the first to propose a systematic way of measuring grasp performance in dense 
clutter. We hope to expand on this analysis of dense clutter in the future. This 
system is available as a ROS package at http://wiki.ros.org/agile_grasp, 
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